Technical report¶

Frontiers¶

Frontiers runs a number of open access journals in several scientific fields. Authors can submit their articles for publication to one of these journals. However, in some cases the authors may not be aware of the journal that best matches the scope of their paper. If the wrong journal is chosen, it may result in delays or even rejection. To this end, we are developing a feature that suggests to the authors the three most relevant journals to their manuscript, to choose from.

You are tasked to build a text classifier for this feature that, given some input text, can recommend the most suitable Frontiers journals to it.

You have at your disposal a .jsonl file containing:

  • Article identifier
  • Body text
  • Frontiers journal name for all articles published by Frontiers in January 2020. You can find it here: https://drive.google.com/file/d/1es3EX0MdDAeolwFl_K_fS3RP0JFRxE2U/view?usp=sharing Remarks:
  • The solution should be coded in Python.
  • You can use any Python library you may find useful.
  • Together with the code you should also provide a report where you describe your approach and present the results.
  • You are particularly encouraged to discuss the choice of the evaluation metric(s) and how this translates to business value.
  • (last but not least) As you write code for this assignment, keep in mind that it will be reviewed (and in real life, put in production) by other colleagues. Clean code, a modular structure, python packaging, testability, explicit dependencies, documentation, are all things that can facilitate the team!

Please email your solution in .zip format to davide.fiocco@frontiersin.org and be prepared to discuss it in the next interview stage.

Summary¶

This report is divided into the following sections:

  • Introduction: In this section I introduce the problem by providing references and context.
  • Data and evaluation metrics:: In this section, I show an exploratory data analysis (EDA) of the given dataset providing useful insights for the definition of the best methods. Furthermore, I introduce the evaluation metrics that will be used to define the best method.
  • Methods: In this section, I describe the set of tested methods used to provide the best recommendation system.
  • Results: Here, I show the results of each method using the defined evaluation metrics comparing them with a trivial baseline.
  • Conclusion: Finally, I choose the best method considering both time and model performance justifying the reasons.
  • Deployment and application: In this section, I show how to easily deploy the model as REST API and interact with it with a simple web app.

Introduction¶

The task consists to develop an algorithm that, given a scientific paper (or a simple text/report), it recommends the most suitable Frontiers journals. Several methodologies could be used to define the best recommendation system and classifier. However, it strongly depends on the number of classes (Frontiers Journals) to be predicted. A previous study (Meijer et al. Document Embedding for Scientific Articles: Efficacy of Word Embeddings vs TFIDF. 2021), already compare document embeddings using TFIDF and WordEmbeddings for classification of a huge dataset of scientific papers (70 million) into 30 thousand distinct journals or conferences.

Here, I develop several variations of document embedding using:

  • all document text;
  • only a list of keywords is extracted from the text (from 3 to 7).

From the text defined before I tested several embedding strategies such as:

  • TFIDF: In information retrieval, TFIDF, short for term frequency-inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches for information retrieval, text mining, and user modeling. The TFIDF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. TFIDF is one of the most popular term-weighting schemes today. A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries use TFIDF.Wikipedia
  • Word2Vec: I used the pretrained Spacy word embeddings from the language model en_core_web_lg. I decided to not train my own word2vec (or FastText, Glove, etc.) word embeddings because of the short time to finish the assignment and the small dimension of the dataset.
  • SBERT: Sentence BERT (Reimers et al. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. (2019)) is a transformer for sentence-pair regression tasks like semantic textual similarity. While the architecture of BERT makes it unsuitable for sentence-pair regression tasks like semantic textual similarity, SBERT provides semantically meaningful sentence embeddings that can be compared using cosine similarity.

Data and evaluation metrics¶

This section is divided in:

  • Setup:The setup for the correct operation of the notebook
  • EDA: An EDA of the dataset
  • Evaluation metrics: The presentation of the evaluation metrics

Setup¶

Simple import of all needed libraries

In [1]:
import os
path = "/".join(os.getcwd().split("/")[:-1])
os.chdir(path)

from src.utils.utils import load_data
from src.preprocess.preprocess import filter_papers_min_sample
from src.utils.utils import load_data, IO

from src.train.document_approach import create_embeddings_document

from src.train.train import (train_test,
                        train_embeddings_keyword_word2vec,
                        train_embeddings_document_word2vec,
                        train_embeddings_keyword_tfidf,
                        train_embeddings_document_tfidf,
                        train_embeddings_keyword_sbert,
                        train_embeddings_document_sbert
                        )                        
from src.preprocess.preprocess import filter_papers_min_sample, preprocess
from src.evaluate.evaluate import (evaluate_document_word2vec, 
                               evaluate_keyword_word2vec,
                               evaluate_keyword_tfidf,
                               evaluate_document_tfidf,
                               evaluate_keyword_sbert,
                               evaluate_document_sbert)

from sklearn.manifold import TSNE
import numpy as np

import warnings
warnings.filterwarnings("ignore")

%matplotlib inline
import plotly.express as px
[nltk_data] Downloading package punkt to /home/felipe/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/felipe/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/felipe/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/felipe/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!

EDA¶

The EDA shows:

  1. The number of scientific articles published for each Frontiers journal;
  2. The distribution of the length of the text;
  3. The definition of the train and test split;
  4. The definition of the preprocessing of the text;

    1. Normalization:
      1. Lowercase
      2. Remove punctaction
      3. Remove numbers
    2. Removing defined unseful words
    3. Removing stopwords
  5. The definition of the keywords and how to extract them;

In [2]:
# Load of the dataset
df = load_data()
In [3]:
# Print of one example
df.head(5)
Out[3]:
id text journal
0 465950 \n Sleep Characteristics and Influencing Facto... Frontiers in Medicine
1 483526 A Hybrid Approach for Modeling Type 2 Diabetes... Frontiers in Genetics
2 482500 \n Relationship Between SES and Academic Achie... Frontiers in Psychology
3 437333 Environmental Health Research in Africa: Impor... Frontiers in Genetics
4 486515 \n 3,5-T2—A Janus-Faced Thyroid Hormone Metabo... Frontiers in Endocrinology
In [4]:
# 1. The number of scientific articles published for each Frontiers journal;
documents_per_journal, df_subset = filter_papers_min_sample(df)
documents_per_journal = documents_per_journal.reset_index().rename(columns={0:"count"})
fig = px.bar(documents_per_journal, x='journal', y='count')
fig.show()

Conclusion 1: The number of published articles for each journal is strongly unbalanced. In order to evaluate the methods presented in the next section, I filter the original dataset with only the journals that received at least 2 publications.

In [5]:
# 2. The distribution of the length of the text;
df_subset["len_text"] = df_subset["text"].apply(lambda x: len(x))
fig = px.histogram(df_subset, x="len_text",nbins=100)
fig.show()

Conclusion 2: It does not seem a normal distribution because of the tail (it seems a binomial distribution). However, the sample dimension is too short to evaluate.

In [6]:
# 3. The definition of the train and test split;
df_train, df_test = train_test(df_subset)

Conclusion 3: The test size is defined at 33%

In [7]:
# 4. Preprocessing of the train and 
df_train = IO(filename="df_train_preprocessed",folder="02_intermediate",format_="pickle").load()
df_test = IO(filename="df_test_preprocessed",folder="02_intermediate",format_="pickle").load()    
df_train[["id","text","preprocessed_text","journal"]].head(5)
Out[7]:
id text preprocessed_text journal
0 494570 \n \n Low Testosterone in Adolescents & Young ... testosterone adolescents young adults jordan c... Frontiers in Endocrinology
1 483146 \n Dynamics and Outcome of Macrophage Interact... dynamics outcome macrophage interaction salmon... Frontiers in Cellular and Infection Microbiology
2 493402 \n Oral Treatments With Probiotics and Live S... oral treatments probiotics live salmonella vac... Frontiers in Microbiology
3 475909 \n A Systematic, Regional Assessment of High M... systematic regional assessment high mountain a... Frontiers in Earth Science
4 508059 \n Hot Water Extracted and Non-extracted Willo... water extracted extracted willow biomass stora... Frontiers in Energy Research
In [8]:
# 5. The definition of the keywords and how to extract them;
df_train[["id","text","keywords","journal"]].head(5)
Out[8]:
id text keywords journal
0 494570 \n \n Low Testosterone in Adolescents & Young ... [testosterone, obesity, diabetes, adolescence,... Frontiers in Endocrinology
1 483146 \n Dynamics and Outcome of Macrophage Interact... [S . Typhimurium, S . Gallinarum, S . Dublin, ... Frontiers in Cellular and Infection Microbiology
2 493402 \n Oral Treatments With Probiotics and Live S... [probiotics, poultry, intestine, neurochemical... Frontiers in Microbiology
3 475909 \n A Systematic, Regional Assessment of High M... [digital elevation model (DEM), Himalaya, geod... Frontiers in Earth Science
4 508059 \n Hot Water Extracted and Non-extracted Willo... [willow biomass, hot water extraction, bioener... Frontiers in Energy Research

Conclusion 5: Each scientific paper is published with a list of keywords that identify the article. I define a simple rule to extract the keywords from the text when they are located between the word "Keywords:" and "Citation:". Furthermore, in case of the rule-based keyword extractor fails I applied the TextRank algorithm to extract the top 5 keywords from the text.

Evaluation metrics¶

To analyze and compare the results I used five metrics:

  • Average accuracy: I consider a correct prediction when one of the three recommended journals is the journal where the paper has been published. Conversely, when the journal is not present in the list of the three published papers I consider it a wrong prediction. The accuracy is not weighted, so no consideration was made about the support for each class.
  • Mean Reciprocal Rank: The mean reciprocal rank is a statistic measure for evaluating any process that produces a list of possible responses to a sample of queries, ordered by the probability of correctness Wikipedia.
  • Precision, Recall, and F1-Score: I calculate precision, recall, and f1-score for each class. I assume a true positive when the journal is inside the list of recommended journals. If it is not inside, I consider the journal in the first position of the rank as the wrong prediction. With this consideration, I can define the precision, the recall, and the f1-score for each class.

Furthermore, I define a baseline model. This should the simpler possible model. As the classes are strongly unbalanced, I define a model that always predicts the top three journals with a higher number of publications. The results show:

In [9]:
baseline_model_performance = IO(filename="evaluation_baseline",folder="05_report",format_="json").load()
print(f'Average accuracy: {baseline_model_performance["accuracy_total"]}; MRR: {baseline_model_performance["mean_reciprocal_rank"]}')
print(baseline_model_performance["precision_recall_f1score"])
Average accuracy: 0.26; MRR: 0.16
                                                  precision    recall  f1-score   support

                       Frontiers for Young Minds       0.00      0.00      0.00         6
                 Frontiers in Aging Neuroscience       0.00      0.00      0.00        10
 Frontiers in Applied Mathematics and Statistics       0.00      0.00      0.00         2
            Frontiers in Artificial Intelligence       0.00      0.00      0.00         1
       Frontiers in Astronomy and Space Sciences       0.00      0.00      0.00         1
            Frontiers in Behavioral Neuroscience       0.00      0.00      0.00         7
                           Frontiers in Big Data       0.00      0.00      0.00         1
   Frontiers in Bioengineering and Biotechnology       0.00      0.00      0.00        18
                         Frontiers in Blockchain       0.00      0.00      0.00         2
                  Frontiers in Built Environment       0.00      0.00      0.00         3
            Frontiers in Cardiovascular Medicine       0.00      0.00      0.00         4
     Frontiers in Cell and Developmental Biology       0.00      0.00      0.00        15
              Frontiers in Cellular Neuroscience       0.00      0.00      0.00         9
Frontiers in Cellular and Infection Microbiology       0.00      0.00      0.00        15
                          Frontiers in Chemistry       0.00      0.00      0.00        23
                      Frontiers in Communication       0.00      0.00      0.00         1
         Frontiers in Computational Neuroscience       0.00      0.00      0.00         5
                      Frontiers in Earth Science       0.00      0.00      0.00         9
              Frontiers in Ecology and Evolution       0.00      0.00      0.00        11
                          Frontiers in Education       0.00      0.00      0.00         4
                      Frontiers in Endocrinology       0.00      0.00      0.00        17
                    Frontiers in Energy Research       0.00      0.00      0.00         5
              Frontiers in Environmental Science       0.00      0.00      0.00         3
          Frontiers in Forests and Global Change       0.00      0.00      0.00         3
                           Frontiers in Genetics       0.00      0.00      0.00        33
                 Frontiers in Human Neuroscience       0.00      0.00      0.00         8
                         Frontiers in Immunology       1.00      1.00      1.00        61
           Frontiers in Integrative Neuroscience       0.00      0.00      0.00         2
                     Frontiers in Marine Science       0.00      0.00      0.00        21
                          Frontiers in Materials       0.00      0.00      0.00         6
             Frontiers in Mechanical Engineering       0.00      0.00      0.00         2
                           Frontiers in Medicine       0.00      0.00      0.00        14
                       Frontiers in Microbiology       0.11      1.00      0.20        79
              Frontiers in Molecular Biosciences       0.00      0.00      0.00         6
             Frontiers in Molecular Neuroscience       0.00      0.00      0.00         8
                    Frontiers in Neural Circuits       0.00      0.00      0.00         2
                       Frontiers in Neuroanatomy       0.00      0.00      0.00         1
                   Frontiers in Neuroinformatics       0.00      0.00      0.00         1
                          Frontiers in Neurology       0.00      0.00      0.00        25
                      Frontiers in Neurorobotics       0.00      0.00      0.00         3
                       Frontiers in Neuroscience       0.00      0.00      0.00        29
                          Frontiers in Nutrition       0.00      0.00      0.00         3
                           Frontiers in Oncology       0.00      0.00      0.00        43
                         Frontiers in Pediatrics       0.00      0.00      0.00        14
                       Frontiers in Pharmacology       0.00      0.00      0.00        57
                            Frontiers in Physics       0.00      0.00      0.00        11
                         Frontiers in Physiology       0.00      0.00      0.00        35
                      Frontiers in Plant Science       0.00      0.00      0.00        48
                         Frontiers in Psychiatry       0.00      0.00      0.00        28
                         Frontiers in Psychology       1.00      1.00      1.00        73
                      Frontiers in Public Health       0.00      0.00      0.00        10
                    Frontiers in Robotics and AI       0.00      0.00      0.00         5
                          Frontiers in Sociology       0.00      0.00      0.00         1
           Frontiers in Sports and Active Living       0.00      0.00      0.00         3
                            Frontiers in Surgery       0.00      0.00      0.00         3
           Frontiers in Sustainable Food Systems       0.00      0.00      0.00         4
              Frontiers in Synaptic Neuroscience       0.00      0.00      0.00         1
               Frontiers in Systems Neuroscience       0.00      0.00      0.00         3
                 Frontiers in Veterinary Science       0.00      0.00      0.00        15

                                        accuracy                           0.26       833
                                       macro avg       0.04      0.05      0.04       833
                                    weighted avg       0.17      0.26      0.18       833

Methods¶

The tested methods could be split into two contexts:

  1. Using all the text of the document;
  2. Using only a list of keywords that represent the text.

For both representations of the documents, I define three methodologies to create a document embedding and then a journal embedding such as:

  1. TFIDF
  2. Spacy word embeddings (word2Vec)
  3. SBERT

All text representation¶

Considering the text of the documents of the training set, I define

  1. TFIDF vector for each document (considering all corpora with a maximum of 10.000 features);
  2. Document embeddings using Spacy pre-trained Word2Vec model.
  3. SBERT embeddings. To obtain the SBERT embeddings I split each document into sentences and for each of them, I calculate the SBERT embeddings. Finally, I average all the sentence embeddings to obtain a document embedding.

After this first step, I have an embedding for each document. Finally, I average all the documents embeddings related to the same Frontiers Journal obtaining a journal embeddings.

List of keywords representation¶

Considering the list of keywords extracted from each document, I define the document embedding as the average of the embeddings associated with each keyword. Where the embeddings associated with each keyword are defined using:

  1. TFIDF vector (considering a maximum of 10.000 features);
  2. Word embeddings using Spacy pre-trained Word2Vec model.
  3. SBERT embedding for each keyword.

After these steps, I have an embedding for each document. Finally, I average all the documents embeddings related to the same Frontiers Journal obtaining journal embeddings.

Results¶

Here I show the perfomance of the tested models

All text representation¶

TFIDF¶

For TFIDF I show the documents embedding of the training set in 2D. To reduce the dimensions of the embeddings I use TSNE. The figure shows papers published in the same journal are are close to each other.

In [10]:
df_train_embeddings = create_embeddings_document(df_train, "tfidf")
In [11]:
def get_tsne(df):
    tsne = TSNE(n_components=2,perplexity=30, n_iter=1000, random_state=42)
    tsne_result = tsne.fit_transform(np.stack(df["embeddings"], axis=0))
    df_tsne = df[["journal"]]
    df_tsne['tsne-2d-one'] = tsne_result[:,0]
    df_tsne['tsne-2d-two'] = tsne_result[:,1]
    fig = px.scatter(df_tsne, x="tsne-2d-one", y="tsne-2d-two", color="journal")
    fig.show()
    
get_tsne(df_train_embeddings)    
In [12]:
model = IO(filename="evaluation_document_tfidf",folder="05_report",format_="json").load()
print(f'Average accuracy: {model["accuracy_total"]}; MRR: {model["mean_reciprocal_rank"]}')
print(model["precision_recall_f1score"])
Average accuracy: 0.81; MRR: 0.68
                                                  precision    recall  f1-score   support

                       Frontiers for Young Minds       1.00      1.00      1.00         6
                 Frontiers in Aging Neuroscience       0.67      0.80      0.73        10
 Frontiers in Applied Mathematics and Statistics       0.00      0.00      0.00         2
            Frontiers in Artificial Intelligence       0.00      0.00      0.00         1
       Frontiers in Astronomy and Space Sciences       0.00      0.00      0.00         1
            Frontiers in Behavioral Neuroscience       0.57      0.57      0.57         7
                           Frontiers in Big Data       0.00      0.00      0.00         1
   Frontiers in Bioengineering and Biotechnology       0.76      0.72      0.74        18
                         Frontiers in Blockchain       0.00      0.00      0.00         2
                  Frontiers in Built Environment       0.00      0.00      0.00         3
            Frontiers in Cardiovascular Medicine       0.75      0.75      0.75         4
     Frontiers in Cell and Developmental Biology       0.55      0.80      0.65        15
              Frontiers in Cellular Neuroscience       0.43      0.33      0.38         9
Frontiers in Cellular and Infection Microbiology       0.82      0.60      0.69        15
                          Frontiers in Chemistry       0.92      0.96      0.94        23
                      Frontiers in Communication       0.00      0.00      0.00         1
         Frontiers in Computational Neuroscience       0.80      0.80      0.80         5
                      Frontiers in Earth Science       0.88      0.78      0.82         9
              Frontiers in Ecology and Evolution       0.71      0.91      0.80        11
                          Frontiers in Education       1.00      0.75      0.86         4
                      Frontiers in Endocrinology       1.00      0.53      0.69        17
                    Frontiers in Energy Research       1.00      1.00      1.00         5
              Frontiers in Environmental Science       0.67      0.67      0.67         3
          Frontiers in Forests and Global Change       1.00      1.00      1.00         3
                           Frontiers in Genetics       0.78      0.88      0.83        33
                 Frontiers in Human Neuroscience       0.50      0.38      0.43         8
                         Frontiers in Immunology       0.90      0.89      0.89        61
           Frontiers in Integrative Neuroscience       0.00      0.00      0.00         2
                     Frontiers in Marine Science       0.90      0.86      0.88        21
                          Frontiers in Materials       1.00      1.00      1.00         6
             Frontiers in Mechanical Engineering       0.00      0.00      0.00         2
                           Frontiers in Medicine       0.55      0.86      0.67        14
                       Frontiers in Microbiology       0.91      0.92      0.92        79
              Frontiers in Molecular Biosciences       0.67      0.33      0.44         6
             Frontiers in Molecular Neuroscience       1.00      0.50      0.67         8
                    Frontiers in Neural Circuits       0.00      0.00      0.00         2
                       Frontiers in Neuroanatomy       0.00      0.00      0.00         1
                   Frontiers in Neuroinformatics       0.00      0.00      0.00         1
                          Frontiers in Neurology       0.78      0.84      0.81        25
                      Frontiers in Neurorobotics       1.00      0.67      0.80         3
                       Frontiers in Neuroscience       0.56      0.76      0.65        29
                          Frontiers in Nutrition       0.00      0.00      0.00         3
                           Frontiers in Oncology       0.86      0.98      0.91        43
                         Frontiers in Pediatrics       0.73      0.79      0.76        14
                       Frontiers in Pharmacology       0.78      0.91      0.84        57
                            Frontiers in Physics       0.47      0.82      0.60        11
                         Frontiers in Physiology       0.88      0.66      0.75        35
                      Frontiers in Plant Science       1.00      0.90      0.95        48
                         Frontiers in Psychiatry       1.00      1.00      1.00        28
                         Frontiers in Psychology       0.79      0.96      0.86        73
                      Frontiers in Public Health       1.00      0.70      0.82        10
                    Frontiers in Robotics and AI       0.80      0.80      0.80         5
                          Frontiers in Sociology       0.00      0.00      0.00         1
           Frontiers in Sports and Active Living       1.00      0.33      0.50         3
                            Frontiers in Surgery       0.00      0.00      0.00         3
           Frontiers in Sustainable Food Systems       1.00      0.25      0.40         4
              Frontiers in Synaptic Neuroscience       0.00      0.00      0.00         1
               Frontiers in Systems Neuroscience       0.00      0.00      0.00         3
                 Frontiers in Veterinary Science       0.93      0.93      0.93        15

                                        accuracy                           0.81       833
                                       macro avg       0.58      0.54      0.55       833
                                    weighted avg       0.80      0.81      0.79       833

Word2Vec¶

In [13]:
model = IO(filename="evaluation_document_word2vec",folder="05_report",format_="json").load()
print(f'Average accuracy: {model["accuracy_total"]}; MRR: {model["mean_reciprocal_rank"]}')
print(model["precision_recall_f1score"])
Average accuracy: 0.66; MRR: 0.52
                                                  precision    recall  f1-score   support

                       Frontiers for Young Minds       1.00      1.00      1.00         6
                 Frontiers in Aging Neuroscience       0.44      0.40      0.42        10
 Frontiers in Applied Mathematics and Statistics       0.40      1.00      0.57         2
            Frontiers in Artificial Intelligence       0.50      1.00      0.67         1
       Frontiers in Astronomy and Space Sciences       1.00      1.00      1.00         1
            Frontiers in Behavioral Neuroscience       0.62      0.71      0.67         7
                           Frontiers in Big Data       0.00      0.00      0.00         1
   Frontiers in Bioengineering and Biotechnology       0.73      0.44      0.55        18
                         Frontiers in Blockchain       1.00      0.50      0.67         2
                  Frontiers in Built Environment       0.38      1.00      0.55         3
            Frontiers in Cardiovascular Medicine       0.29      0.50      0.36         4
     Frontiers in Cell and Developmental Biology       0.34      0.87      0.49        15
              Frontiers in Cellular Neuroscience       0.57      0.89      0.70         9
Frontiers in Cellular and Infection Microbiology       0.75      0.80      0.77        15
                          Frontiers in Chemistry       0.82      0.78      0.80        23
                      Frontiers in Communication       1.00      1.00      1.00         1
         Frontiers in Computational Neuroscience       0.44      0.80      0.57         5
                      Frontiers in Earth Science       0.82      1.00      0.90         9
              Frontiers in Ecology and Evolution       0.60      0.82      0.69        11
                          Frontiers in Education       0.30      0.75      0.43         4
                      Frontiers in Endocrinology       0.60      0.71      0.65        17
                    Frontiers in Energy Research       0.62      1.00      0.77         5
              Frontiers in Environmental Science       0.20      0.33      0.25         3
          Frontiers in Forests and Global Change       0.50      1.00      0.67         3
                           Frontiers in Genetics       0.72      0.55      0.62        33
                 Frontiers in Human Neuroscience       0.35      0.88      0.50         8
                         Frontiers in Immunology       0.90      0.85      0.87        61
           Frontiers in Integrative Neuroscience       0.00      0.00      0.00         2
                     Frontiers in Marine Science       0.85      0.52      0.65        21
                          Frontiers in Materials       0.86      1.00      0.92         6
             Frontiers in Mechanical Engineering       0.50      1.00      0.67         2
                           Frontiers in Medicine       0.38      0.43      0.40        14
                       Frontiers in Microbiology       0.92      0.70      0.79        79
              Frontiers in Molecular Biosciences       0.36      0.67      0.47         6
             Frontiers in Molecular Neuroscience       0.56      0.62      0.59         8
                    Frontiers in Neural Circuits       0.20      0.50      0.29         2
                       Frontiers in Neuroanatomy       0.00      0.00      0.00         1
                   Frontiers in Neuroinformatics       0.00      0.00      0.00         1
                          Frontiers in Neurology       0.67      0.64      0.65        25
                      Frontiers in Neurorobotics       0.15      0.67      0.25         3
                       Frontiers in Neuroscience       0.58      0.24      0.34        29
                          Frontiers in Nutrition       0.00      0.00      0.00         3
                           Frontiers in Oncology       0.84      0.88      0.86        43
                         Frontiers in Pediatrics       0.44      0.50      0.47        14
                       Frontiers in Pharmacology       0.92      0.63      0.75        57
                            Frontiers in Physics       0.80      0.36      0.50        11
                         Frontiers in Physiology       0.71      0.29      0.41        35
                      Frontiers in Plant Science       0.97      0.71      0.82        48
                         Frontiers in Psychiatry       0.65      0.54      0.59        28
                         Frontiers in Psychology       0.93      0.78      0.85        73
                      Frontiers in Public Health       0.57      0.40      0.47        10
                    Frontiers in Robotics and AI       0.67      0.80      0.73         5
                          Frontiers in Sociology       0.33      1.00      0.50         1
           Frontiers in Sports and Active Living       0.50      0.67      0.57         3
                            Frontiers in Surgery       0.00      0.00      0.00         3
           Frontiers in Sustainable Food Systems       0.50      0.50      0.50         4
              Frontiers in Synaptic Neuroscience       0.25      1.00      0.40         1
               Frontiers in Systems Neuroscience       0.50      0.33      0.40         3
                 Frontiers in Veterinary Science       0.69      0.73      0.71        15

                                        accuracy                           0.66       833
                                       macro avg       0.55      0.64      0.55       833
                                    weighted avg       0.74      0.66      0.68       833

SBERT¶

In [14]:
model = IO(filename="evaluation_document_sbert",folder="05_report",format_="json").load()
print(f'Average accuracy: {model["accuracy_total"]}; MRR: {model["mean_reciprocal_rank"]}')
print(model["precision_recall_f1score"])
Average accuracy: 0.83; MRR: 0.71
                                                  precision    recall  f1-score   support

                       Frontiers for Young Minds       1.00      0.33      0.50         6
                 Frontiers in Aging Neuroscience       0.70      0.70      0.70        10
 Frontiers in Applied Mathematics and Statistics       1.00      1.00      1.00         2
            Frontiers in Artificial Intelligence       1.00      1.00      1.00         1
       Frontiers in Astronomy and Space Sciences       1.00      1.00      1.00         1
            Frontiers in Behavioral Neuroscience       0.56      0.71      0.63         7
                           Frontiers in Big Data       0.50      1.00      0.67         1
   Frontiers in Bioengineering and Biotechnology       0.80      0.67      0.73        18
                         Frontiers in Blockchain       1.00      1.00      1.00         2
                  Frontiers in Built Environment       1.00      1.00      1.00         3
            Frontiers in Cardiovascular Medicine       0.57      1.00      0.73         4
     Frontiers in Cell and Developmental Biology       0.56      1.00      0.71        15
              Frontiers in Cellular Neuroscience       0.62      0.89      0.73         9
Frontiers in Cellular and Infection Microbiology       0.67      0.93      0.78        15
                          Frontiers in Chemistry       0.95      0.87      0.91        23
                      Frontiers in Communication       1.00      1.00      1.00         1
         Frontiers in Computational Neuroscience       0.60      0.60      0.60         5
                      Frontiers in Earth Science       0.90      1.00      0.95         9
              Frontiers in Ecology and Evolution       0.85      1.00      0.92        11
                          Frontiers in Education       1.00      0.75      0.86         4
                      Frontiers in Endocrinology       0.94      0.94      0.94        17
                    Frontiers in Energy Research       0.83      1.00      0.91         5
              Frontiers in Environmental Science       1.00      1.00      1.00         3
          Frontiers in Forests and Global Change       1.00      1.00      1.00         3
                           Frontiers in Genetics       0.78      0.85      0.81        33
                 Frontiers in Human Neuroscience       0.40      0.75      0.52         8
                         Frontiers in Immunology       0.97      0.93      0.95        61
           Frontiers in Integrative Neuroscience       0.00      0.00      0.00         2
                     Frontiers in Marine Science       0.90      0.86      0.88        21
                          Frontiers in Materials       1.00      1.00      1.00         6
             Frontiers in Mechanical Engineering       1.00      1.00      1.00         2
                           Frontiers in Medicine       0.67      0.71      0.69        14
                       Frontiers in Microbiology       0.94      0.85      0.89        79
              Frontiers in Molecular Biosciences       0.50      0.83      0.62         6
             Frontiers in Molecular Neuroscience       0.40      1.00      0.57         8
                    Frontiers in Neural Circuits       0.20      0.50      0.29         2
                       Frontiers in Neuroanatomy       1.00      1.00      1.00         1
                   Frontiers in Neuroinformatics       0.33      1.00      0.50         1
                          Frontiers in Neurology       0.88      0.88      0.88        25
                      Frontiers in Neurorobotics       0.75      1.00      0.86         3
                       Frontiers in Neuroscience       0.62      0.28      0.38        29
                          Frontiers in Nutrition       0.50      0.67      0.57         3
                           Frontiers in Oncology       0.91      1.00      0.96        43
                         Frontiers in Pediatrics       0.85      0.79      0.81        14
                       Frontiers in Pharmacology       0.92      0.77      0.84        57
                            Frontiers in Physics       1.00      0.82      0.90        11
                         Frontiers in Physiology       0.81      0.49      0.61        35
                      Frontiers in Plant Science       0.98      0.92      0.95        48
                         Frontiers in Psychiatry       1.00      0.93      0.96        28
                         Frontiers in Psychology       0.92      0.90      0.91        73
                      Frontiers in Public Health       0.89      0.80      0.84        10
                    Frontiers in Robotics and AI       1.00      0.40      0.57         5
                          Frontiers in Sociology       1.00      1.00      1.00         1
           Frontiers in Sports and Active Living       0.60      1.00      0.75         3
                            Frontiers in Surgery       0.67      0.67      0.67         3
           Frontiers in Sustainable Food Systems       1.00      1.00      1.00         4
              Frontiers in Synaptic Neuroscience       1.00      1.00      1.00         1
               Frontiers in Systems Neuroscience       0.67      0.67      0.67         3
                 Frontiers in Veterinary Science       0.94      1.00      0.97        15

                                        accuracy                           0.83       833
                                       macro avg       0.80      0.84      0.80       833
                                    weighted avg       0.86      0.83      0.83       833

List of keywords representation¶

TFIDF¶

In [15]:
model = IO(filename="evaluation_keywords_tfidf",folder="05_report",format_="json").load()
print(f'Average accuracy: {model["accuracy_total"]}; MRR: {model["mean_reciprocal_rank"]}')
print(model["precision_recall_f1score"])
Average accuracy: 0.56; MRR: 0.42
                                                  precision    recall  f1-score   support

                       Frontiers for Young Minds       0.00      0.00      0.00         6
                 Frontiers in Aging Neuroscience       0.35      0.70      0.47        10
 Frontiers in Applied Mathematics and Statistics       0.00      0.00      0.00         2
            Frontiers in Artificial Intelligence       0.00      0.00      0.00         1
       Frontiers in Astronomy and Space Sciences       0.00      0.00      0.00         1
            Frontiers in Behavioral Neuroscience       0.20      0.29      0.24         7
                           Frontiers in Big Data       0.00      0.00      0.00         1
   Frontiers in Bioengineering and Biotechnology       0.39      0.39      0.39        18
                         Frontiers in Blockchain       0.40      1.00      0.57         2
                  Frontiers in Built Environment       0.00      0.00      0.00         3
            Frontiers in Cardiovascular Medicine       0.10      0.25      0.14         4
     Frontiers in Cell and Developmental Biology       0.60      0.40      0.48        15
              Frontiers in Cellular Neuroscience       0.00      0.00      0.00         9
Frontiers in Cellular and Infection Microbiology       0.36      0.27      0.31        15
                          Frontiers in Chemistry       0.65      0.48      0.55        23
                      Frontiers in Communication       0.33      1.00      0.50         1
         Frontiers in Computational Neuroscience       0.30      0.60      0.40         5
                      Frontiers in Earth Science       0.67      0.44      0.53         9
              Frontiers in Ecology and Evolution       0.25      0.18      0.21        11
                          Frontiers in Education       0.33      0.50      0.40         4
                      Frontiers in Endocrinology       0.75      0.35      0.48        17
                    Frontiers in Energy Research       0.25      0.60      0.35         5
              Frontiers in Environmental Science       0.17      0.33      0.22         3
          Frontiers in Forests and Global Change       0.33      0.67      0.44         3
                           Frontiers in Genetics       0.53      0.70      0.61        33
                 Frontiers in Human Neuroscience       0.22      0.25      0.24         8
                         Frontiers in Immunology       0.69      0.85      0.76        61
           Frontiers in Integrative Neuroscience       0.00      0.00      0.00         2
                     Frontiers in Marine Science       0.81      0.62      0.70        21
                          Frontiers in Materials       0.40      0.33      0.36         6
             Frontiers in Mechanical Engineering       0.00      0.00      0.00         2
                           Frontiers in Medicine       0.57      0.29      0.38        14
                       Frontiers in Microbiology       0.84      0.67      0.75        79
              Frontiers in Molecular Biosciences       0.09      0.17      0.12         6
             Frontiers in Molecular Neuroscience       0.14      0.12      0.13         8
                    Frontiers in Neural Circuits       0.40      1.00      0.57         2
                       Frontiers in Neuroanatomy       0.00      0.00      0.00         1
                   Frontiers in Neuroinformatics       0.00      0.00      0.00         1
                          Frontiers in Neurology       0.56      0.56      0.56        25
                      Frontiers in Neurorobotics       0.40      0.67      0.50         3
                       Frontiers in Neuroscience       0.60      0.52      0.56        29
                          Frontiers in Nutrition       0.29      0.67      0.40         3
                           Frontiers in Oncology       0.72      0.77      0.74        43
                         Frontiers in Pediatrics       0.56      0.64      0.60        14
                       Frontiers in Pharmacology       0.70      0.49      0.58        57
                            Frontiers in Physics       0.29      0.45      0.36        11
                         Frontiers in Physiology       0.65      0.49      0.56        35
                      Frontiers in Plant Science       0.78      0.60      0.68        48
                         Frontiers in Psychiatry       0.71      0.79      0.75        28
                         Frontiers in Psychology       0.81      0.70      0.75        73
                      Frontiers in Public Health       0.45      0.50      0.48        10
                    Frontiers in Robotics and AI       0.60      0.60      0.60         5
                          Frontiers in Sociology       0.00      0.00      0.00         1
           Frontiers in Sports and Active Living       0.00      0.00      0.00         3
                            Frontiers in Surgery       0.20      0.33      0.25         3
           Frontiers in Sustainable Food Systems       0.33      0.25      0.29         4
              Frontiers in Synaptic Neuroscience       0.20      1.00      0.33         1
               Frontiers in Systems Neuroscience       0.25      0.33      0.29         3
                 Frontiers in Veterinary Science       0.67      0.53      0.59        15

                                        accuracy                           0.56       833
                                       macro avg       0.35      0.41      0.36       833
                                    weighted avg       0.60      0.56      0.57       833

Word2Vec¶

In [16]:
model = IO(filename="evaluation_keywords_word2vec",folder="05_report",format_="json").load()
print(f'Average accuracy: {model["accuracy_total"]}; MRR: {model["mean_reciprocal_rank"]}')
print(model["precision_recall_f1score"])
Average accuracy: 0.64; MRR: 0.5
                                                  precision    recall  f1-score   support

                       Frontiers for Young Minds       0.00      0.00      0.00         6
                 Frontiers in Aging Neuroscience       0.71      0.50      0.59        10
 Frontiers in Applied Mathematics and Statistics       0.50      1.00      0.67         2
            Frontiers in Artificial Intelligence       0.00      0.00      0.00         1
       Frontiers in Astronomy and Space Sciences       0.50      1.00      0.67         1
            Frontiers in Behavioral Neuroscience       0.40      0.57      0.47         7
                           Frontiers in Big Data       0.00      0.00      0.00         1
   Frontiers in Bioengineering and Biotechnology       0.57      0.44      0.50        18
                         Frontiers in Blockchain       1.00      1.00      1.00         2
                  Frontiers in Built Environment       1.00      1.00      1.00         3
            Frontiers in Cardiovascular Medicine       0.00      0.00      0.00         4
     Frontiers in Cell and Developmental Biology       0.34      0.67      0.45        15
              Frontiers in Cellular Neuroscience       0.33      0.33      0.33         9
Frontiers in Cellular and Infection Microbiology       0.52      0.73      0.61        15
                          Frontiers in Chemistry       0.75      0.65      0.70        23
                      Frontiers in Communication       0.00      0.00      0.00         1
         Frontiers in Computational Neuroscience       0.33      0.40      0.36         5
                      Frontiers in Earth Science       1.00      0.67      0.80         9
              Frontiers in Ecology and Evolution       0.56      0.82      0.67        11
                          Frontiers in Education       0.25      0.25      0.25         4
                      Frontiers in Endocrinology       0.54      0.41      0.47        17
                    Frontiers in Energy Research       0.62      1.00      0.77         5
              Frontiers in Environmental Science       1.00      0.67      0.80         3
          Frontiers in Forests and Global Change       1.00      1.00      1.00         3
                           Frontiers in Genetics       0.61      0.58      0.59        33
                 Frontiers in Human Neuroscience       0.40      0.75      0.52         8
                         Frontiers in Immunology       0.75      0.74      0.74        61
           Frontiers in Integrative Neuroscience       0.00      0.00      0.00         2
                     Frontiers in Marine Science       0.92      0.57      0.71        21
                          Frontiers in Materials       0.55      1.00      0.71         6
             Frontiers in Mechanical Engineering       1.00      0.50      0.67         2
                           Frontiers in Medicine       0.30      0.43      0.35        14
                       Frontiers in Microbiology       0.86      0.68      0.76        79
              Frontiers in Molecular Biosciences       0.18      0.50      0.26         6
             Frontiers in Molecular Neuroscience       0.33      0.62      0.43         8
                    Frontiers in Neural Circuits       0.33      0.50      0.40         2
                       Frontiers in Neuroanatomy       0.00      0.00      0.00         1
                   Frontiers in Neuroinformatics       0.00      0.00      0.00         1
                          Frontiers in Neurology       0.60      0.48      0.53        25
                      Frontiers in Neurorobotics       0.33      1.00      0.50         3
                       Frontiers in Neuroscience       0.53      0.28      0.36        29
                          Frontiers in Nutrition       0.29      0.67      0.40         3
                           Frontiers in Oncology       0.75      0.84      0.79        43
                         Frontiers in Pediatrics       0.71      0.71      0.71        14
                       Frontiers in Pharmacology       0.86      0.65      0.74        57
                            Frontiers in Physics       0.60      0.82      0.69        11
                         Frontiers in Physiology       0.83      0.57      0.68        35
                      Frontiers in Plant Science       0.95      0.73      0.82        48
                         Frontiers in Psychiatry       0.60      0.64      0.62        28
                         Frontiers in Psychology       0.80      0.82      0.81        73
                      Frontiers in Public Health       0.39      0.70      0.50        10
                    Frontiers in Robotics and AI       0.50      0.40      0.44         5
                          Frontiers in Sociology       0.00      0.00      0.00         1
           Frontiers in Sports and Active Living       0.50      0.33      0.40         3
                            Frontiers in Surgery       0.25      0.33      0.29         3
           Frontiers in Sustainable Food Systems       0.60      0.75      0.67         4
              Frontiers in Synaptic Neuroscience       0.00      0.00      0.00         1
               Frontiers in Systems Neuroscience       0.29      0.67      0.40         3
                 Frontiers in Veterinary Science       0.77      0.67      0.71        15

                                        accuracy                           0.64       833
                                       macro avg       0.50      0.54      0.50       833
                                    weighted avg       0.68      0.64      0.65       833

SBERT¶

In [17]:
model = IO(filename="evaluation_keywords_sbert",folder="05_report",format_="json").load()
print(f'Average accuracy: {model["accuracy_total"]}; MRR: {model["mean_reciprocal_rank"]}')
print(model["precision_recall_f1score"])
Average accuracy: 0.73; MRR: 0.56
                                                  precision    recall  f1-score   support

                       Frontiers for Young Minds       0.30      0.50      0.37         6
                 Frontiers in Aging Neuroscience       0.67      0.60      0.63        10
 Frontiers in Applied Mathematics and Statistics       0.40      1.00      0.57         2
            Frontiers in Artificial Intelligence       1.00      1.00      1.00         1
       Frontiers in Astronomy and Space Sciences       1.00      1.00      1.00         1
            Frontiers in Behavioral Neuroscience       0.58      1.00      0.74         7
                           Frontiers in Big Data       0.25      1.00      0.40         1
   Frontiers in Bioengineering and Biotechnology       0.67      0.56      0.61        18
                         Frontiers in Blockchain       1.00      1.00      1.00         2
                  Frontiers in Built Environment       1.00      0.67      0.80         3
            Frontiers in Cardiovascular Medicine       0.50      0.75      0.60         4
     Frontiers in Cell and Developmental Biology       0.48      0.73      0.58        15
              Frontiers in Cellular Neuroscience       0.57      0.44      0.50         9
Frontiers in Cellular and Infection Microbiology       0.55      0.80      0.65        15
                          Frontiers in Chemistry       0.86      0.83      0.84        23
                      Frontiers in Communication       1.00      1.00      1.00         1
         Frontiers in Computational Neuroscience       0.36      0.80      0.50         5
                      Frontiers in Earth Science       0.90      1.00      0.95         9
              Frontiers in Ecology and Evolution       0.80      0.73      0.76        11
                          Frontiers in Education       1.00      0.50      0.67         4
                      Frontiers in Endocrinology       0.62      0.47      0.53        17
                    Frontiers in Energy Research       1.00      0.60      0.75         5
              Frontiers in Environmental Science       0.67      0.67      0.67         3
          Frontiers in Forests and Global Change       0.75      1.00      0.86         3
                           Frontiers in Genetics       0.85      0.67      0.75        33
                 Frontiers in Human Neuroscience       0.46      0.75      0.57         8
                         Frontiers in Immunology       0.75      0.85      0.80        61
           Frontiers in Integrative Neuroscience       0.00      0.00      0.00         2
                     Frontiers in Marine Science       1.00      0.81      0.89        21
                          Frontiers in Materials       0.75      1.00      0.86         6
             Frontiers in Mechanical Engineering       1.00      1.00      1.00         2
                           Frontiers in Medicine       0.58      0.50      0.54        14
                       Frontiers in Microbiology       0.90      0.81      0.85        79
              Frontiers in Molecular Biosciences       0.31      0.67      0.42         6
             Frontiers in Molecular Neuroscience       0.36      0.50      0.42         8
                    Frontiers in Neural Circuits       0.50      1.00      0.67         2
                       Frontiers in Neuroanatomy       0.00      0.00      0.00         1
                   Frontiers in Neuroinformatics       0.00      0.00      0.00         1
                          Frontiers in Neurology       0.57      0.48      0.52        25
                      Frontiers in Neurorobotics       0.75      1.00      0.86         3
                       Frontiers in Neuroscience       0.53      0.31      0.39        29
                          Frontiers in Nutrition       0.33      0.67      0.44         3
                           Frontiers in Oncology       0.79      0.88      0.84        43
                         Frontiers in Pediatrics       0.82      0.64      0.72        14
                       Frontiers in Pharmacology       0.75      0.70      0.73        57
                            Frontiers in Physics       0.70      0.64      0.67        11
                         Frontiers in Physiology       0.76      0.54      0.63        35
                      Frontiers in Plant Science       0.91      0.83      0.87        48
                         Frontiers in Psychiatry       0.70      0.75      0.72        28
                         Frontiers in Psychology       0.88      0.84      0.86        73
                      Frontiers in Public Health       0.58      0.70      0.64        10
                    Frontiers in Robotics and AI       1.00      0.60      0.75         5
                          Frontiers in Sociology       0.00      0.00      0.00         1
           Frontiers in Sports and Active Living       0.60      1.00      0.75         3
                            Frontiers in Surgery       0.50      0.67      0.57         3
           Frontiers in Sustainable Food Systems       0.75      0.75      0.75         4
              Frontiers in Synaptic Neuroscience       1.00      1.00      1.00         1
               Frontiers in Systems Neuroscience       0.33      0.67      0.44         3
                 Frontiers in Veterinary Science       0.75      0.80      0.77        15

                                        accuracy                           0.73       833
                                       macro avg       0.65      0.71      0.66       833
                                    weighted avg       0.75      0.73      0.73       833

Conclusion¶

The results show a better performance using SBERT to create a document embedding using all text. However, the performance is quite similar to the TFIDF approach. The method to use in a production environment depends on the application. If the main purpose is accuracy, SBERT is the right choice, but such a method is computationally expensive. While TFIDF is faster but it needs a high RAM to load such large vectors.

Another interesting result is the comparison between TFIDF and Word2Vec. Indeed, as previously described in the paper Meijer et al. Document Embedding for Scientific Articles: Efficacy of Word Embeddings vs TFIDF. 2021, I observe, using the entire document, TFIDF performs better compared with word2Vec, while using keywords the result is inverted (word2vec performs better than TFIDF). What really is unexpected is SBERT approach works better than both the other methods in all cases.

Deployment and application¶

Finally, the best model is deployed as REST API FastAPI and accessed through a very simple web interface Streamlit.

The best model chosen is the SBERT, due to its better performance compared with the others. However, it is the slowest one. In production, the environment should be evaluated as the alternative to deploying it in a GPU server or moving to another model (like TFIDF).